{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2ab684ec",
   "metadata": {},
   "source": [
    "## Read CSV data into a pandas dataframe\n",
    "\n",
    "The data used in this notebook is [Kaggle 48K movies](https://www.kaggle.com/datasets/yashgupta24/48000-movies-dataset) which contains a lot of metadata in addition to the raw review text.\n",
    "\n",
    "Usually there is a data cleaning step.  Such as replace empty strings with \"\" or unusual and empty fields with median values.  Below, I'll just drop rows with null values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "1b26d094",
   "metadata": {},
   "outputs": [],
   "source": [
    "# For colab install these libraries in this order:\n",
    "# !pip install numpy pandas torch pymilvus langchain transformers sentence-transformers "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "12cc6627",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import common libraries.\n",
    "import sys, os, time, pprint\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# Import custom functions for splitting and search.\n",
    "sys.path.append(\"..\")  # Adds higher directory to python modules path.\n",
    "import milvus_utilities as _utils"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ecab0d57",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "48513\n",
      "45036\n",
      "Example text length: 6556\n",
      "('Example text: Sallie Gardner at a Gallop Sallie Gardner at a Gallop is a '\n",
      " 'short starring Gilbert Domm and Sallie Gardner. The clip shows a jockey, '\n",
      " 'Domm, riding a horse, Sally Gardner. The clip is not filmed but instead '\n",
      " 'consists of 24 individual photographs shot in rapid... Sometimes ascribed as '\n",
      " '\"The Father of the Motion Picture\", Eadweard Muybridge undeniably '\n",
      " 'accomplished exploiting and sometimes introducing a means of instantaneous '\n",
      " 'and serial images to analyze and synthesize animal locomotion. In part, the '\n",
      " \"reasons for and the claims made of his work support Virgilio Tosi's thesis \"\n",
      " 'that cinema was invented out of the needs of scientific research. '\n",
      " \"Conversely, they're informed by Muybridge's background as an artistic \"\n",
      " 'location photographer and, as Phillip Prodger suggests, in book sales and '\n",
      " 'more useful to art than to science, as Marta Braun has demonstrated (see '\n",
      " 'sources at bottom). Additionally, Muybridge quickly exploited their '\n",
      " 'entertainment value via projection to audiences across the U.S. and Europe. '\n",
      " 'Muybridge pursued both of these paths of invention: the path taken by Jules '\n",
      " 'Janssen, Étienne-Jules Marey and others for science and the path taken by '\n",
      " 'Ottomar Anschütz, Thomas Edison, the Lumiére brothers and others for fame '\n",
      " 'and profit.\\n'\n",
      " '\\n'\n",
      " 'Muybridge began taking instantaneous single photographs of multi-millionaire '\n",
      " \"railroad magnate Leland Stanford's horses in motion in 1872. It was disputed \"\n",
      " \"at the time whether all four of a horse's legs were off the ground \"\n",
      " 'simultaneously at any time while running. Although no surviving photographs '\n",
      " 'prove it, contemporary lithographs and paintings likely based on the '\n",
      " 'photographs, indeed, show the moment of \"unsupported transit\". In between '\n",
      " 'and interrupting these experiments, Muybridge was found not guilty of the '\n",
      " \"admittedly premeditated fatal shooting of his wife's lover and possibly her \"\n",
      " \"son's father.\\n\"\n",
      " '\\n'\n",
      " \"Publication of Marey's graphic measurements of a horse's movements reignited \"\n",
      " \"Stanford's interest in the gait of horses. In turn, Marey was convinced to \"\n",
      " \"switch to photography in his motion studies after witnessing Muybridge's \"\n",
      " 'work (see \"Falling Cat\" (1894)). This work in \"automatic '\n",
      " 'electro-photographs\" began in 1878 at Stanford\\'s Palo Alto Stock Farm. '\n",
      " 'Multiple cameras were stored in a shed parallel to a track. A series of '\n",
      " 'closing boards serving as shutters were triggered by tripped threads and '\n",
      " 'electrical means. The wet collodion process of the time, reportedly, could '\n",
      " 'need up to half a minute for an exposure. For the split-second shutter '\n",
      " 'speeds required here, a white canvas background and powdered lime on the '\n",
      " 'track provided more contrast to compensate for less light getting to the '\n",
      " \"glass plates. Employees of Stanford's Central Pacific Railroad and others \"\n",
      " 'helped in constructing this \"set\" and camera equipment.\\n'\n",
      " '\\n'\n",
      " 'Contrary to unattributed claims on the web, this so-called \"Sallie Gardner '\n",
      " 'at a Gallop\" wasn\\'t the first series photographed by Muybridge. Six series '\n",
      " 'of Muybridge\\'s first subjects were published on cards entitled \"The Horse '\n",
      " 'in Motion\". The first is of the horse Abe Edgington trotting on 11 June '\n",
      " '1878. Reporters were invited for the next two series on June 15th, and, as '\n",
      " 'they reported, again, Abe went first—trotting and pulling the driver behind '\n",
      " 'in a sulky, which is what tripped the threads. The second subject that day '\n",
      " 'was Sallie Gardner running and, thus, the mare had to trip the threads. '\n",
      " 'Reporters noted how this spooked her and how that was reflected in the '\n",
      " 'negatives developed on the spot. As one article said, she \"gave a wild bound '\n",
      " 'in the air, breaking the saddle girth as she left the ground.\" Based on such '\n",
      " \"descriptions, it doesn't seem that this series exists anymore. The \"\n",
      " 'animations on the web that are actually of Sallie are dated June 19th on '\n",
      " '\"The Horse in Motion\" card. Many animations claimed to be Sallie on YouTube, '\n",
      " 'Wikipedia and elsewhere, as of this date, are actually of a mare named Annie '\n",
      " \"G. and were part of Muybridge's University of Pennsylvania work published in \"\n",
      " '1887, as the Library of Congress and other reliable sources have made clear. '\n",
      " \"The early Palo Alto photographs aren't as detailed and are closer to \"\n",
      " \"silhouettes. The 12 images of Gardner also include one where she's \"\n",
      " \"stationary. The Morse's Gallery pictures are entirely in silhouette, while \"\n",
      " 'the La Nature engravings of these same images show the rider in a white '\n",
      " 'shirt.\\n'\n",
      " '\\n'\n",
      " 'The shot of the horse stationary, as Braun points out, was added later and '\n",
      " 'is indicative of the artistic and un-scientific assemblages Muybridge made '\n",
      " 'of his images—with the intent of publication, including in his own books. '\n",
      " 'This was especially prominent in his Pennsylvania work, which included many '\n",
      " 'nude models that were surely useful for art. Muybridge influenced artists '\n",
      " 'from Realists like Thomas Eakins and Meissonier, Impressionists like Edgar '\n",
      " 'Degas and Frederick Remington, to the more abstract works of Francis Bacon. '\n",
      " 'His precedence has also been cited in the photography of Steven Pippin and '\n",
      " 'Hollis Frampton, as well as the bullet-time effects in \"The Matrix\" (1999).\\n'\n",
      " '\\n'\n",
      " 'Muybridge lectured on this relationship with art when touring with his '\n",
      " 'Zoöpraxiscope, which was a combination of the magic lantern and '\n",
      " 'phenakistoscope. With it, he projected, from glass disks, facsimiles of his '\n",
      " 'photographs hand-painted by Erwin Faber. Without intermittent movement, the '\n",
      " 'Zoöpraxiscope compressed the images, so elongated drawings were used instead '\n",
      " 'of photographs. Muybridge and others also used his images for '\n",
      " 'phenakistoscopes and zoetropes. The first demonstration of the Zoöpraxiscope '\n",
      " 'was to Stanford and friends in the autumn of 1879. A public demonstration '\n",
      " 'was given on 4 May 1880 for the San Francisco art association, and Muybridge '\n",
      " 'continued these lectures for years—personally touring the U.S. and Europe. '\n",
      " 'Although there were predecessors in animated projections as far back as 1847 '\n",
      " 'by Leopold Ludwig Döbler, in 1853 by Franz von Uchatius, and with posed '\n",
      " 'photographs by Henry Heyl in 1870, the chronophotographic and artistic basis '\n",
      " \"offered some novelty for Muybridge's presentations. They also led him to \"\n",
      " 'meet Edison and Marey and inspire the likes of Anschütz and others—those who '\n",
      " 'took the next steps in the invention of movies.\\n'\n",
      " '\\n'\n",
      " '(Main Sources: \"The Inventor and the Tycoon\" by Edward Ball. \"Eadweard '\n",
      " 'Muybridge\" and \"Picturing Time\" by Marta Braun. \"The Man Who Stopped Time\" '\n",
      " 'by Brian Clegg. \"Man in Motion\" by Robert Bartlett Haas. \"The Father of the '\n",
      " 'Motion Picture\" by Gordon Hendricks. \"The Stanford Years, 1872-1882\" edited '\n",
      " 'by Anita Ventura Mozley. \"Time Stands Still\" by Phillip Prodger. \"Cinema '\n",
      " 'Before Cinema\" by Virgilio Tosi.)')\n",
      "id               int64\n",
      "url             object\n",
      "Name            object\n",
      "PosterLink      object\n",
      "Genres          object\n",
      "Actors          object\n",
      "Director        object\n",
      "Keywords        object\n",
      "RatingValue    float32\n",
      "text            object\n",
      "MovieYear        int64\n",
      "dtype: object\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>url</th>\n",
       "      <th>Name</th>\n",
       "      <th>PosterLink</th>\n",
       "      <th>Genres</th>\n",
       "      <th>Actors</th>\n",
       "      <th>Director</th>\n",
       "      <th>Keywords</th>\n",
       "      <th>RatingValue</th>\n",
       "      <th>text</th>\n",
       "      <th>MovieYear</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>48232</th>\n",
       "      <td>48232</td>\n",
       "      <td>https://www.imdb.com/title/tt7044010/</td>\n",
       "      <td>Rory Scovel Tries Stand-Up for the First Time</td>\n",
       "      <td>https://m.media-amazon.com/images/M/MV5BZWUxM2...</td>\n",
       "      <td>[Documentary, Comedy]</td>\n",
       "      <td>[Rory Scovel, Jack White, Ben Swank, Ben Kronb...</td>\n",
       "      <td>Scott Moran</td>\n",
       "      <td>[stand up special, stand up comedy, tv special]</td>\n",
       "      <td>6.7</td>\n",
       "      <td>Rory Scovel Tries Stand-Up for the First Time ...</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48233</th>\n",
       "      <td>48233</td>\n",
       "      <td>https://www.imdb.com/title/tt7048622/</td>\n",
       "      <td>L'insulte</td>\n",
       "      <td>https://m.media-amazon.com/images/M/MV5BZTI1NG...</td>\n",
       "      <td>[Crime, Drama, Thriller]</td>\n",
       "      <td>[Adel Karam, Kamel El Basha, Camille Salameh, ...</td>\n",
       "      <td>Ziad Doueiri</td>\n",
       "      <td>[court case, courtroom drama, prejudice, insul...</td>\n",
       "      <td>7.7</td>\n",
       "      <td>L'insulte L'insulte is a movie starring Adel K...</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          id                                    url  \\\n",
       "48232  48232  https://www.imdb.com/title/tt7044010/   \n",
       "48233  48233  https://www.imdb.com/title/tt7048622/   \n",
       "\n",
       "                                                Name  \\\n",
       "48232  Rory Scovel Tries Stand-Up for the First Time   \n",
       "48233                                      L'insulte   \n",
       "\n",
       "                                              PosterLink  \\\n",
       "48232  https://m.media-amazon.com/images/M/MV5BZWUxM2...   \n",
       "48233  https://m.media-amazon.com/images/M/MV5BZTI1NG...   \n",
       "\n",
       "                         Genres  \\\n",
       "48232     [Documentary, Comedy]   \n",
       "48233  [Crime, Drama, Thriller]   \n",
       "\n",
       "                                                  Actors      Director  \\\n",
       "48232  [Rory Scovel, Jack White, Ben Swank, Ben Kronb...   Scott Moran   \n",
       "48233  [Adel Karam, Kamel El Basha, Camille Salameh, ...  Ziad Doueiri   \n",
       "\n",
       "                                                Keywords  RatingValue  \\\n",
       "48232    [stand up special, stand up comedy, tv special]          6.7   \n",
       "48233  [court case, courtroom drama, prejudice, insul...          7.7   \n",
       "\n",
       "                                                    text  MovieYear  \n",
       "48232  Rory Scovel Tries Stand-Up for the First Time ...       2017  \n",
       "48233  L'insulte L'insulte is a movie starring Adel K...       2017  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Import common libraries.\n",
    "import sys, os, time, pprint\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# Import custom functions for splitting and search.\n",
    "sys.path.append(\"..\")  # Adds higher directory to python modules path.\n",
    "import milvus_utilities as _utils\n",
    "\n",
    "# Read CSV data.\n",
    "df = pd.read_csv(\"data/original_data.csv\")\n",
    "\n",
    "# Concatenate 'Name', 'Keywords', and 'Description' into 'text' column\n",
    "df['text'] = df['Name'] + ' ' + df['Description'] + ' ' + df['ReviewBody']\n",
    "# drop rows without any text.\n",
    "print(df.shape[0])\n",
    "df = df.dropna(subset=['text'])\n",
    "print(df.shape[0])\n",
    "\n",
    "# Convert genres from string with commas in it to list of strings.\n",
    "df['Genres'] = df['Genres'].str.split(',')\n",
    "df['Genres'] = df['Genres'].apply(lambda d: d if isinstance(d, list) else [\"\"])\n",
    "\n",
    "# Convert actors from string with commas in it to list of strings.\n",
    "df['Actors'] = df['Actors'].str.split(',')\n",
    "df['Actors'] = df['Actors'].apply(lambda d: d if isinstance(d, list) else [\"\"])\n",
    "\n",
    "# Convert keywords from string with commas in it to list of strings.\n",
    "df['Keywords'] = df['Keywords'].str.split(',')\n",
    "df['Keywords'] = df['Keywords'].apply(lambda d: d if isinstance(d, list) else [\"\"])\n",
    "\n",
    "# Extract out just the year from the date.\n",
    "def extract_year(movie_date):\n",
    "    try:\n",
    "        return int(movie_date.split('-')[0])\n",
    "    except Exception:\n",
    "        return -1  # return -1 instead of None\n",
    "df['MovieYear'] = df.DatePublished.apply(extract_year)\n",
    "\n",
    "# Convert 'Rating' to float.\n",
    "df['RatingValue'] = pd.to_numeric(df['RatingValue'], errors='coerce')\n",
    "df['RatingValue'] = df['RatingValue'].fillna(-1).astype('float32')\n",
    "\n",
    "# Drop extra rating columns.\n",
    "df.drop(columns=['RatingCount', 'BestRating', 'WorstRating',\n",
    "                 'ReviewAurthor', 'ReviewDate',\t'ReviewBody',\t\n",
    "                 'Description', 'duration', 'DatePublished'], inplace=True)\n",
    "\n",
    "# Inspect text.\n",
    "print(f\"Example text length: {len(df.text[0])}\")\n",
    "pprint.pprint(f\"Example text: {df.text[0]}\")\n",
    "\n",
    "# Shortcut the data for testing.\n",
    "df = df.tail(250).copy()\n",
    "\n",
    "print(df.dtypes)\n",
    "display(df.head(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a19e525",
   "metadata": {},
   "source": [
    "## Connect using Milvus Lite\n",
    "\n",
    "Milvus Lite is a light Python server that can run locally.  It's ideal for getting started with Milvus, running on a laptop, in a Jupyter notebook, or on Colab. \n",
    "\n",
    "⛔️ Please note Milvus Lite is only meant for demos, not for production workloads.\n",
    "\n",
    "- [github](https://github.com/milvus-io/milvus-lite)\n",
    "- [documentation](https://milvus.io/docs/quickstart.md)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "80de564c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pymilvus:2.4.3\n"
     ]
    }
   ],
   "source": [
    "# STEP 1. CONNECT TO MILVUS LITE.\n",
    "\n",
    "# !python -m pip install -U --no-cache-dir pymilvus\n",
    "import pymilvus\n",
    "print(f\"pymilvus:{pymilvus.__version__}\")\n",
    "\n",
    "# Connect a client to the Milvus Lite server.\n",
    "from pymilvus import MilvusClient\n",
    "mc = MilvusClient(\"milvus_demo.db\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "963eacb3",
   "metadata": {},
   "source": [
    "# Optional - Connect using Milvus standalone Docker\n",
    "\n",
    "This section uses [Milvus standalone](https://milvus.io/docs/configure-docker.md) on Docker. <br>\n",
    ">⛔️ Make sure you pip install the correct version of pymilvus and server yml file.  **Versions (major and minor) should all match**.\n",
    "\n",
    "1. [Install Docker](https://docs.docker.com/get-docker/)\n",
    "2. Start your Docker Desktop\n",
    "3. Download the latest [docker-compose.yml](https://milvus.io/docs/install_standalone-docker.md#Download-the-YAML-file) (or run the wget command, replacing version to what you are using)\n",
    "> wget https://github.com/milvus-io/milvus/releases/download/v2.4.0-rc.1/milvus-standalone-docker-compose.yml -O docker-compose.yml\n",
    "4. From your terminal:  \n",
    "   - cd into directory where you saved the .yml file (usualy same dir as this notebook)\n",
    "   - docker compose up -d\n",
    "   - verify (either in terminal or on Docker Desktop) the containers are running\n",
    "5. From your code (see notebook code below):\n",
    "   - Import milvus\n",
    "   - Connect to the local milvus server"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ba692141",
   "metadata": {},
   "outputs": [],
   "source": [
    "# # STEP 1. CONNECT TO MILVUS STANDALONE DOCKER.\n",
    "# import pymilvus\n",
    "\n",
    "# # print(f\"Pymilvus: {pymilvus.__version__}\") #2.4.3\n",
    "# # !wget https://github.com/milvus-io/milvus/releases/download/v2.4.1/milvus-standalone-docker-compose.yml -O docker-compose.yml\n",
    "\n",
    "# import pymilvus, time\n",
    "# from pymilvus import (MilvusClient, utility, connections)\n",
    "# print(f\"Pymilvus: {pymilvus.__version__}\")\n",
    "\n",
    "# # Start the Milvus server.\n",
    "# # !docker compose up -d\n",
    "\n",
    "# # Connect to the local server.\n",
    "# connection = connections.connect(\n",
    "#   alias=\"default\", \n",
    "#   host='localhost', # or '0.0.0.0' or 'localhost'\n",
    "#   port='19530'\n",
    "# )\n",
    "\n",
    "# # Get server version.\n",
    "# print(f\"Milvus server: {utility.get_server_version()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b01d6622",
   "metadata": {},
   "source": [
    "## Load the Embedding Model checkpoint and use it to create vector embeddings\n",
    "**Embedding model:**  We will use the open-source [sentence transformers](https://www.sbert.net/docs/pretrained_models.html) available on HuggingFace to encode the documentation text.  We will download the model from HuggingFace and run it locally. \n",
    "\n",
    "💡Tip:  A good way to choose a sentence transformer model is to check the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).  Sort descending by column \"Retrieval Average\" and choose the best-performing small model.\n",
    "\n",
    "Two model parameters of note below:\n",
    "1. EMBEDDING_DIM refers to the dimensionality or length of the embedding vector. In this case, the embeddings generated for EACH token in the input text will have the SAME length = 1024. This size of embedding is often associated with BERT-based models, where the embeddings are used for downstream tasks such as classification, question answering, or text generation. <br><br>\n",
    "2. MAX_SEQ_LENGTH is the maximum Context Length the encoder model can handle for input sequences. In this case, if sequences longer than 512 tokens are given to the model, everything longer will be (silently!) chopped off.  This is the reason why a chunking strategy is needed to segment input texts into chunks with lengths that will fit in the model's input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "abe46dbf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# !python -m pip install -U sentence-transformers transformers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "dd2be7fd",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/miniconda3/envs/py311-unum/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "model_name: BAAI/bge-large-en-v1.5\n",
      "EMBEDDING_DIM: 1024\n",
      "MAX_SEQ_LENGTH: 512\n"
     ]
    }
   ],
   "source": [
    "# STEP 2. DOWNLOAD AN OPEN SOURCE EMBEDDING MODEL.\n",
    "\n",
    "# Import torch.\n",
    "import torch\n",
    "from sentence_transformers import SentenceTransformer\n",
    "\n",
    "# Initialize torch settings\n",
    "torch.backends.cudnn.deterministic = True\n",
    "DEVICE = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')\n",
    "\n",
    "# Load the model from huggingface model hub.\n",
    "model_name = \"BAAI/bge-large-en-v1.5\"\n",
    "encoder = SentenceTransformer(model_name, device=DEVICE)\n",
    "\n",
    "# Get the model parameters and save for later.\n",
    "EMBEDDING_DIM = encoder.get_sentence_embedding_dimension()\n",
    "MAX_SEQ_LENGTH_IN_TOKENS = encoder.get_max_seq_length() \n",
    "# Assume tokens are 3 characters long.\n",
    "MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS * 3\n",
    "EOS_TOKEN_LENGTH = 1 * 3\n",
    "\n",
    "# Use 512 sequence length.\n",
    "MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS\n",
    "EOS_TOKEN_LENGTH = 1\n",
    "\n",
    "# Inspect model parameters.\n",
    "print(f\"model_name: {model_name}\")\n",
    "print(f\"EMBEDDING_DIM: {EMBEDDING_DIM}\")\n",
    "print(f\"MAX_SEQ_LENGTH: {MAX_SEQ_LENGTH}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a Milvus collection\n",
    "\n",
    "You can think of a collection in Milvus like a \"table\" in SQL databases.  The **collection** will contain the \n",
    "- **Schema** (or [no-schema Milvus client](https://milvus.io/docs/using_milvusclient.md)).  \n",
    "💡 You'll need the vector `EMBEDDING_DIM` parameter from your embedding model.\n",
    "Typical values are:\n",
    "   - 1024 for sbert embedding models\n",
    "   - 1536 for ada-002 OpenAI embedding models\n",
    "- **Vector index** for efficient vector search\n",
    "- **Vector distance metric** for measuring nearest neighbor vectors\n",
    "- **Consistency level**\n",
    "In Milvus, transactional consistency is possible; however, according to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), some latency must be sacrificed. 💡 Searching movie reviews is not mission-critical, so [`eventually`](https://milvus.io/docs/consistency.md) consistent is fine here.\n",
    "\n",
    "## Add a Vector Index\n",
    "\n",
    "The vector index determines the vector **search algorithm** used to find the closest vectors in your data to the query a user submits.  \n",
    "\n",
    "Most vector indexes use different sets of parameters depending on whether the database is:\n",
    "- **inserting vectors** (creation mode) - vs - \n",
    "- **searching vectors** (search mode) \n",
    "\n",
    "Scroll down the [docs page](https://milvus.io/docs/index.md) to see a table listing different vector indexes available on Milvus.  For example:\n",
    "- FLAT - deterministic exhaustive search\n",
    "- IVF_FLAT or IVF_SQ8 - Hash index (stochastic approximate search)\n",
    "- HNSW - Graph index (stochastic approximate search)\n",
    "- AUTOINDEX - Automatically determined based on OSS vs [Zilliz cloud](https://docs.zilliz.com/docs/autoindex-explained), type of GPU, size of data.\n",
    "\n",
    "Besides a search algorithm, we also need to specify a **distance metric**, that is, a definition of what is considered \"close\" in vector space.  In the cell below, the [`HNSW`](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) search index is chosen.  Its possible distance metrics are one of:\n",
    "- L2 - L2-norm\n",
    "- IP - Dot-product\n",
    "- COSINE - Angular distance\n",
    "\n",
    "💡 Most use cases work better with normalized embeddings, in which case L2 is useless (every vector has length=1) and IP and COSINE are the same.  Only choose L2 if you plan to keep your embeddings unnormalized."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully dropped collection: `IMDB_metadata`\n"
     ]
    },
    {
     "ename": "NameError",
     "evalue": "name 'connection' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[8], line 18\u001b[0m\n\u001b[1;32m      9\u001b[0m     \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSuccessfully dropped collection: `\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mCOLLECTION_NAME\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m`\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m     11\u001b[0m \u001b[38;5;66;03m# # Check if collection already exists, if so drop it.\u001b[39;00m\n\u001b[1;32m     12\u001b[0m \u001b[38;5;66;03m# has = utility.has_collection(COLLECTION_NAME)\u001b[39;00m\n\u001b[1;32m     13\u001b[0m \u001b[38;5;66;03m# if has:\u001b[39;00m\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m     16\u001b[0m \n\u001b[1;32m     17\u001b[0m \u001b[38;5;66;03m# Use no-schema Milvus client uses flexible json key:value format.\u001b[39;00m\n\u001b[0;32m---> 18\u001b[0m mc \u001b[38;5;241m=\u001b[39m MilvusClient(connections\u001b[38;5;241m=\u001b[39m\u001b[43mconnection\u001b[49m)\n\u001b[1;32m     20\u001b[0m \u001b[38;5;66;03m# Create a collection with flexible schema and AUTOINDEX.\u001b[39;00m\n\u001b[1;32m     21\u001b[0m mc\u001b[38;5;241m.\u001b[39mcreate_collection(COLLECTION_NAME, \n\u001b[1;32m     22\u001b[0m                      EMBEDDING_DIM,\n\u001b[1;32m     23\u001b[0m                      consistency_level\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mEventually\u001b[39m\u001b[38;5;124m\"\u001b[39m, \n\u001b[1;32m     24\u001b[0m                      auto_id\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m,  \n\u001b[1;32m     25\u001b[0m                      overwrite\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m,\n\u001b[1;32m     26\u001b[0m                     )\n",
      "\u001b[0;31mNameError\u001b[0m: name 'connection' is not defined"
     ]
    }
   ],
   "source": [
    "# STEP 3. CREATE A NO-SCHEMA MILVUS COLLECTION AND DEFINE THE DATABASE INDEX.\n",
    "\n",
    "# Create a collection.\n",
    "COLLECTION_NAME = \"IMDB_metadata\"\n",
    "\n",
    "# Milvus Lite uses the MilvusClient object.\n",
    "if mc.has_collection(COLLECTION_NAME):\n",
    "    mc.drop_collection(COLLECTION_NAME)\n",
    "    print(f\"Successfully dropped collection: `{COLLECTION_NAME}`\")\n",
    "\n",
    "# # Check if collection already exists, if so drop it.\n",
    "# has = utility.has_collection(COLLECTION_NAME)\n",
    "# if has:\n",
    "#     drop_result = utility.drop_collection(COLLECTION_NAME)\n",
    "#     print(f\"Successfully dropped collection: `{COLLECTION_NAME}`\")\n",
    "# # Use no-schema Milvus client uses flexible json key:value format.\n",
    "# mc = MilvusClient(connections=connection)\n",
    "\n",
    "# Create a collection with flexible schema and AUTOINDEX.\n",
    "mc.create_collection(COLLECTION_NAME, \n",
    "                     EMBEDDING_DIM,\n",
    "                     consistency_level=\"Eventually\", \n",
    "                     auto_id=True,  \n",
    "                     overwrite=True,\n",
    "                    )\n",
    "print(f\"Successfully created collection: `{COLLECTION_NAME}`\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c60423a5",
   "metadata": {},
   "source": [
    "## Simple Chunking\n",
    "\n",
    "Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap.  This section uses:\n",
    "- **Strategy** = Simple fixed chunk lengths.\n",
    "- **Chunk size** = Use the embedding model's parameter `MAX_SEQ_LENGTH`\n",
    "- **Overlap** = Rule-of-thumb 10-15%\n",
    "- **Function** = \n",
    "  - Langchain's `RecursiveCharacterTextSplitter` to split up long reviews recursively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6ee61e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# STEP 4. PREPARE DATA: CHUNK AND EMBED\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "def recursive_splitter_wrapper(text, chunk_size):\n",
    "\n",
    "    # Default chunk overlap is 10% chunk_size.\n",
    "    chunk_overlap = np.round(chunk_size * 0.10, 0)\n",
    "\n",
    "    # Use langchain's convenient recursive chunking method.\n",
    "    text_splitter = RecursiveCharacterTextSplitter(\n",
    "        chunk_size=chunk_size,\n",
    "        chunk_overlap=chunk_overlap,\n",
    "        length_function=len,\n",
    "    )\n",
    "    chunks = text_splitter.split_text(text)\n",
    "\n",
    "    # Replace special characters with spaces.\n",
    "    chunks = [text.replace(\"<br /><br />\", \" \") for text in chunks]\n",
    "\n",
    "    return chunks\n",
    "\n",
    "# Use recursive splitter to chunk text.\n",
    "def imdb_chunk_text(batch_size, df, chunk_size):\n",
    "\n",
    "    batch = df.head(batch_size).copy()\n",
    "    print(f\"chunk size: {chunk_size}\")\n",
    "    print(f\"original shape: {batch.shape}\")\n",
    "    \n",
    "    start_time = time.time()\n",
    "\n",
    "    # 1. Chunk the text review into chunk_size.\n",
    "    batch['chunk'] = batch['text'].apply(recursive_splitter_wrapper, chunk_size=chunk_size)\n",
    "    # Explode the 'chunk' column to create new rows for each chunk.\n",
    "    batch = batch.explode('chunk', ignore_index=True)\n",
    "    print(f\"new shape: {batch.shape}\")\n",
    "\n",
    "    # 2. Add embeddings as new column in df.\n",
    "    embeddings = torch.tensor(encoder.encode(batch['chunk']))\n",
    "    # Normalize the embeddings.\n",
    "    embeddings = np.array(embeddings / np.linalg.norm(embeddings))\n",
    "\n",
    "    # 3. Convert embeddings to list of `numpy.ndarray`, each containing `numpy.float32` numbers.\n",
    "    converted_values = list(map(np.float32, embeddings))\n",
    "    batch['vector'] = converted_values\n",
    "\n",
    "    end_time = time.time()\n",
    "    print(f\"Chunking + embedding time for {batch_size} docs: {end_time - start_time} sec\")\n",
    "    # Inspect the batch of data.\n",
    "    assert len(batch.chunk[0]) <= MAX_SEQ_LENGTH-1\n",
    "    assert len(batch.vector[0]) == EMBEDDING_DIM\n",
    "    print(f\"type embeddings: {type(batch.vector)} of {type(batch.vector[0])}\")\n",
    "    print(f\"of numbers: {type(batch.vector[0][0])}\")\n",
    "\n",
    "    return batch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01a2a240",
   "metadata": {},
   "outputs": [],
   "source": [
    "## Chunk and Embed Text Data\n",
    "\n",
    "# Use the embedding model parameters.\n",
    "# chunk_size = MAX_SEQ_LENGTH - HF_EOS_TOKEN_LENGTH\n",
    "chunk_size = 512\n",
    "chunk_overlap = np.round(chunk_size * 0.10, 0)\n",
    "\n",
    "# Chunk a batch of data from pandas DataFrame and inspect it.\n",
    "BATCH_SIZE = 800\n",
    "batch = imdb_chunk_text(BATCH_SIZE, df, chunk_size)\n",
    "display(batch.head(2))\n",
    "\n",
    "# Drop the original text column, keep the new 'chunk' column.\n",
    "batch.drop(columns=['id','text'], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9bd8153",
   "metadata": {},
   "source": [
    "## Insert data into Milvus\n",
    "\n",
    "For each original text chunk, we'll write the quadruplet (`vector, text, source, h1, h2`) into the database.\n",
    "\n",
    "<div>\n",
    "<img src=\"../../images/db_insert.png\" width=\"80%\"/>\n",
    "</div>\n",
    "\n",
    "**The Milvus Client wrapper can only handle loading data from a list of dictionaries.**\n",
    "\n",
    "Otherwise, in general, Milvus supports loading data from:\n",
    "- pandas dataframes \n",
    "- list of dictionaries\n",
    "\n",
    "Below, we use the embedding model provided by HuggingFace, download its checkpoint, and run it locally as the encoder.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28b7face",
   "metadata": {},
   "outputs": [],
   "source": [
    "# STEP 6. INSERT CHUNK LIST INTO MILVUS OR ZILLIZ.\n",
    "\n",
    "# Convert the DataFrame to a list of dictionaries\n",
    "dict_list = batch.to_dict(orient='records')\n",
    "\n",
    "# Insert data into the Milvus collection.\n",
    "print(\"Start inserting entities\")\n",
    "start_time = time.time()\n",
    "insert_result = mc.insert(\n",
    "    COLLECTION_NAME,\n",
    "    data=dict_list,\n",
    "    progress_bar=True)\n",
    "end_time = time.time()\n",
    "print(f\"Milvus Client insert time for {batch.shape[0]} vectors: {end_time - start_time} seconds\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2cb04eba",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO - Uncomment to inspect the first chunk and its metadata.\n",
    "print(len(dict_list))\n",
    "print(type(dict_list[0]), len(dict_list[0]))\n",
    "pprint.pprint(dict_list[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c022c38a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define metadata fields you can filter on.\n",
    "OUTPUT_FIELDS = list(dict_list[0].keys())\n",
    "# drop vector field\n",
    "OUTPUT_FIELDS.remove('vector')\n",
    "OUTPUT_FIELDS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "129bc5bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# List distinct genres\n",
    "GENRES = list(set([genre for genres in df['Genres'] for genre in genres]))\n",
    "GENRES"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e84931bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot histogram of rating values.\n",
    "import matplotlib.pyplot as plt\n",
    "plt.figure(figsize=(4, 2))\n",
    "df['RatingValue'].hist();\n",
    "# Scale of 1-10.  Popular movies are rated >= 7."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02c589ff",
   "metadata": {},
   "source": [
    "## Ask a question about your data\n",
    "\n",
    "So far in this demo notebook: \n",
    "1. Your custom data has been mapped into a vector embedding space\n",
    "2. Those vector embeddings have been saved into a vector database\n",
    "\n",
    "Next, you can ask a question about your custom data!\n",
    "\n",
    "💡 In LLM vocabulary:\n",
    "> **Query** is the generic term for user questions.  \n",
    "A query is a list of multiple individual questions, up to maybe 1000 different questions!\n",
    "\n",
    "> **Question** usually refers to a single user question.  \n",
    "In our example below, the user question is \"What is AUTOINDEX in Milvus Client?\"\n",
    "\n",
    "> **Semantic Search** = very fast search of the entire knowledge base to find the `TOP_K` documentation chunks with the closest embeddings to the user's query.\n",
    "\n",
    "💡 The same model should always be used for consistency for all the embeddings data and the query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e7f41f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a sample question about your data.\n",
    "\n",
    "# This question for the JSON dataset.\n",
    "SAMPLE_QUESTION = \"Dystopia science fiction with a robot.\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ea29411",
   "metadata": {},
   "source": [
    "## Execute a vector search\n",
    "\n",
    "Search Milvus using [PyMilvus API](https://milvus.io/docs/search.md).\n",
    "\n",
    "💡 By their nature, vector searches are \"semantic\" searches.  For example, if you were to search for \"leaky faucet\": \n",
    "> **Traditional Key-word Search** - either or both words \"leaky\", \"faucet\" would have to match some text in order to return a web page or link text to the document.\n",
    "\n",
    "> **Semantic search** - results containing words \"drippy\" \"taps\" would be returned as well because these words mean the same thing even though they are different words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9673ce4a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a convenience function for searching.\n",
    "def mc_run_search(question, filter_expression, top_k):\n",
    "    # Embed the question using the same encoder.\n",
    "    query_embeddings = _utils.embed_query(encoder, [question])\n",
    "\n",
    "    # Run semantic vector search using your query and the vector database.\n",
    "    results = mc.search(\n",
    "        COLLECTION_NAME,\n",
    "        data=query_embeddings, \n",
    "        # search_params=SEARCH_PARAMS,\n",
    "        output_fields=OUTPUT_FIELDS, \n",
    "        # Milvus can utilize metadata in boolean expressions to filter search.\n",
    "        filter=filter_expression,\n",
    "        # expr=filter_expression,\n",
    "        limit=top_k,\n",
    "        consistency_level=\"Eventually\"\n",
    "    )\n",
    "\n",
    "    # Assemble retrieved context and context metadata.\n",
    "    # The search result is in the variable `results[0]`, which is type \n",
    "    # 'pymilvus.orm.search.SearchResult'. \n",
    "    METADATA_FIELDS = [f for f in OUTPUT_FIELDS if f != 'chunk']\n",
    "    formatted_results, context, context_metadata = _utils.client_assemble_retrieved_context(\n",
    "        results, metadata_fields=METADATA_FIELDS, num_shot_answers=TOP_K)\n",
    "    \n",
    "    return formatted_results, context, context_metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e25ccac6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# STEP 7. RETRIEVE ANSWERS FROM YOUR DOCUMENTS STORED IN MILVUS OR ZILLIZ.\n",
    "\n",
    "TOP_K = 3\n",
    "# Metadata filters for CSV dataset.\n",
    "expression = \"\"\n",
    "expression='rating >= 6.5'\n",
    "# expression = 'film_year > 1900'\n",
    "# # infix string match.\n",
    "# expression=expression + ' && chunk like \"%fiction%\"'\n",
    "\n",
    "# # exact array element with prefix or infix match.\n",
    "expression=expression + ' && (genres[0] like \"Sci-Fi%\" || genres[1] like \"%Sci-Fi%\" || genres[2] == \"Sci-Fi\")'\n",
    "# expression=expression + ' && (keywords[0] like \"dystop%\" || keywords[1] like \"%dystop%\" || keywords[2] like \"%dystop%\")'\n",
    "# # # array contains a value.\n",
    "# # # https://milvus.io/docs/array_data_type.md#Advanced-filtering\n",
    "# # expression='array_contains_any(genres, [\"Sci-Fi\"])'\n",
    "# # expression='array_contains_any(keywords, [\"knife\"])'\n",
    "# # expression='keywords in [\"knife\"]\n",
    "\n",
    "\n",
    "print(f\"filter: {expression}\")\n",
    "\n",
    "start_time = time.time()\n",
    "formatted_results, contexts, context_metadata = \\\n",
    "    mc_run_search(SAMPLE_QUESTION, expression, TOP_K)\n",
    "elapsed_time = time.time() - start_time\n",
    "print(f\"Milvus Client search time for {len(dict_list)} vectors: {elapsed_time} seconds\")\n",
    "\n",
    "# Inspect search result.\n",
    "print(f\"type: {type(formatted_results)}, count: {len(formatted_results)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb53d3cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display poster link.\n",
    "from IPython.display import Image\n",
    "from IPython.display import display\n",
    "\n",
    "# Loop through recommended movies, display poster, print metadata.\n",
    "seen_movies = []\n",
    "for i in range(len(contexts)):\n",
    "    print(f\"Retrieved result #{i+1}\")\n",
    "    print(f\"distance = {formatted_results[i][0]}\")\n",
    "    # Get the movie_index\n",
    "    movie_index = context_metadata[i]['movie_index']\n",
    "    print(f\"movie_index: {movie_index}\")\n",
    "\n",
    "    # Don't display the same movie_index twice.\n",
    "    if movie_index in seen_movies:\n",
    "        continue\n",
    "    else:\n",
    "        seen_movies.append(movie_index)\n",
    "        # Display the first poster link as a rendered image\n",
    "        x = Image(url = context_metadata[i]['poster_url'], width=150, height=200) \n",
    "        display(x)\n",
    "\n",
    "        # Print the rest of the movie info.\n",
    "        pprint.pprint(f\"chunk: {contexts[i]}\")\n",
    "        # print metadata except the movie_index and poster link.\n",
    "        for key, value in context_metadata[i].items():\n",
    "            if ((key != 'poster_url') and (key != 'movie_index')):\n",
    "                print(f\"{key}: {value}\")\n",
    "        print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6294947f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Drop the collection.\n",
    "# mc.drop_collection(COLLECTION_NAME)\n",
    "# print(f\"Successfully dropped collection: `{COLLECTION_NAME}`\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c777937e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Props to Sebastian Raschka for this handy watermark.\n",
    "# !pip install watermark\n",
    "\n",
    "%load_ext watermark\n",
    "%watermark -a 'Christy Bergman' -v -p torch,transformers,sentence_transformers,pymilvus,langchain --conda"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
